Hypothesis

Training per-folder logistic regression models will be more effective than a single model


In [1]:
# Load data
import pandas as pd
with open('./data_files/8lWZYw-u-yNbGBkC4B--ip77K1oVwwyZTHKLeD7rm7k.csv') as data_file:
    df = pd.read_csv(data_file)
df.head()


Out[1]:
Subject Id ConversationId Importance SentDateTime Body CcRecipients Sender ToRecipients FolderId
0 OPGIdentity Requests Quarterly Review and 6 ot... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-27T11:06:17Z Logocidagendaicon Your agenda for Mond... NaN no-reply@microsoft.com dastrock@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm...
1 Throttling alerts from Gateway AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-21T20:25:49Z The monitor that will create these alerts has ... NaN akina@microsoft.com msodsswat@microsoft.com;estsincident@microsoft... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm...
2 App API Scrum Monday Series and 4 other event... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-20T11:11:30Z Logocidagendaicon Your agenda for Mond... NaN no-reply@microsoft.com dastrock@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm...
3 Throttling alerts from Gateway AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-17T22:39:56Z Description Description Description... NaN akina@microsoft.com msodsswat@microsoft.com;estsincident@microsoft... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm...
4 Notification AAD certificate roll in March for... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-15T19:15:21Z cidimage001png01D16D8184C55A30cidimage003j... NaN shiung.yong@microsoft.com aadpartnersnotify@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm...

Building an Ensemble Classifier


  • Do some preprocessing on the text columns (subject, body, maybe to, cc, from)
    • Clean NaN's or remove rows of data with NaNs
    • Do stuff the Preprocess Text Azure module does for us (stopwords, etc)
    • Use scikit learn where possible
  • Do some feature construction using pandas & scikit learn
    • On subject, body, to, cc, from, etc
    • Feature Hashing
    • TF/IDF
    • Custom TF/IDF (per-folder)
    • One-Hot Encoding (get_dummies)
  • One-Hot Encode FolderId labels into their own boolean columns (1s & 0s)
  • Split data into training & test sets to be used for all ensemble members
  • For each folder, train a model on the training data
    • Probably use logistic regression to start out
    • Consider decision trees, SVMs, or other classifier models
    • Use subject, body, to, cc, from, etc as features
    • Use FolderId boolean column as label (yes/no)
    • Save each model for making predictions
  • Construct ensemble classifier from N folder models
    • For a query message, make N predictions (one per model)
    • Output probabilities/confidences
    • Ensemble prediction is the most confident per-folder prediction
    • Construct a composite probability & output with prediction result
  • Evaluate performance of model on test data, compare to Azure ML models
    • Compare to out of the box logistic regression sci kit learn model
  • Figure out how to persist models in NDB or Cloud Storage for making runtime predictions
  • Figure out how to perform preprocessing & feature construction at runtime
  • Construct REST API for serving predictions
  • Figure out how to deploy this all into production

Constructing Subject Feature Matrix



In [2]:
# Remove messages without a Subject
print df.shape
df = df.dropna(subset=['Subject'])
print df.shape


(10301, 10)
(10295, 10)

In [3]:
# Perform bag of words feature extraction
# TODO: Need to train with fixed vocabulary, otherwise runtime feature construction won't work correctly
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(stop_words='english', lowercase=True)
train_counts = count_vect.fit_transform(df['Subject'])
print 'Dimensions of vocabulary feature matrix are:'
print train_counts.shape


Dimensions of vocabulary feature matrix are:
(10295, 3119)

Strategies for reducing # of columns in feature matrix

  • Add more stop words
  • Remove email addresses
  • Remove URLs
  • Lemmatization
  • Remove number, special characters, sequences of characters like 'aaaaa'
  • Perform manual tokenization to get column names, and inspect types of cols created
  • ...

In [4]:
# Add TF/IDF weighting to account for lenght of documents
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
train_tfidf = tfidf_transformer.fit_transform(train_counts)
print 'Dimensions of vocabulary feature matrix are:'
print train_tfidf.shape
print 'But, its a sparse matrix: ' + str(type(train_tfidf))


Dimensions of vocabulary feature matrix are:
(10295, 3119)
But, its a sparse matrix: <class 'scipy.sparse.csr.csr_matrix'>

Constructing CC, To, and From



In [5]:
# Merge CC, To, From into one People column
df['CcRecipients'].fillna('', inplace=True)
df['ToRecipients'].fillna('', inplace=True)
df['Sender'].fillna('', inplace=True)
df['People'] = df['Sender'] + ';' + df['CcRecipients'] + ';' + df['ToRecipients']
df.head(10)


Out[5]:
Subject Id ConversationId Importance SentDateTime Body CcRecipients Sender ToRecipients FolderId People
0 OPGIdentity Requests Quarterly Review and 6 ot... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-27T11:06:17Z Logocidagendaicon Your agenda for Mond... no-reply@microsoft.com dastrock@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... no-reply@microsoft.com;;dastrock@microsoft.com
1 Throttling alerts from Gateway AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-21T20:25:49Z The monitor that will create these alerts has ... akina@microsoft.com msodsswat@microsoft.com;estsincident@microsoft... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... akina@microsoft.com;;msodsswat@microsoft.com;e...
2 App API Scrum Monday Series and 4 other event... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-20T11:11:30Z Logocidagendaicon Your agenda for Mond... no-reply@microsoft.com dastrock@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... no-reply@microsoft.com;;dastrock@microsoft.com
3 Throttling alerts from Gateway AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-17T22:39:56Z Description Description Description... akina@microsoft.com msodsswat@microsoft.com;estsincident@microsoft... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... akina@microsoft.com;;msodsswat@microsoft.com;e...
4 Notification AAD certificate roll in March for... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-15T19:15:21Z cidimage001png01D16D8184C55A30cidimage003j... shiung.yong@microsoft.com aadpartnersnotify@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... shiung.yong@microsoft.com;;aadpartnersnotify@m...
5 OpenID RP Certification Launch Announcement AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-14T22:40:30Z Part of the OpenID Foundation efforts to conti... oauth@microsoft.com;catpm@microsoft.com;mssts@... michael.jones@microsoft.com openid@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... michael.jones@microsoft.com;oauth@microsoft.co...
6 How to determine if an account is fully provis... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2016-09-06T21:31:42Z Hey guys I am Anbin from OneNote team I am ... pthiruv@exchange.microsoft.com;wenjenc@microso... anbinm@microsoft.com msareq@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... anbinm@microsoft.com;pthiruv@exchange.microsof...
7 Registering map platform component as an app f... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2016-09-01T02:04:17Z Hey MsaReq I’m on the maps platform team an... icheck@microsoft.com msareq@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... icheck@microsoft.com;;msareq@microsoft.com
8 Accepted Fixing MSA Developer Requests AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2016-08-10T19:48:31Z wibartle@microsoft.com dastrock@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... wibartle@microsoft.com;;dastrock@microsoft.com
9 Accepted Fixing MSA Developer Requests AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2016-08-10T19:47:41Z adfrei@microsoft.com dastrock@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... adfrei@microsoft.com;;dastrock@microsoft.com

In [6]:
# Convert People to matrix representation
people_features = df['People'].str.get_dummies(sep=';')
print people_features.shape
people_features.head()


(10295, 3530)
Out[6]:
11franklinc@gmail.com _ram@microsoft.com a-amgeo@microsoft.com a-asokuy@microsoft.com a-barak@microsoft.com a-bewhi@microsoft.com a-libren@microsoft.com a-markr@microsoft.com a-midumi@microsoft.com a-pakhar@microsoft.com ... zideng@microsoft.com zihliu@microsoft.com zion.brewer@microsoft.com zizhong@microsoft.com zlatkom@exchange.microsoft.com zoinertejada@solliance.net zoltanp@exchange.microsoft.com zorauf@microsoft.com zsolt.zombik@zsoltzombik.com zunqwang@microsoft.com
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 3530 columns


In [7]:
# Will need to store people vocabulary for feature construction during predictions
people_vocabulary = people_features.columns
print people_vocabulary[:2]
print len(people_vocabulary)


Index([u'11franklinc@gmail.com', u'_ram@microsoft.com'], dtype='object')
3530

In [8]:
# Convert to csr_matrix and hstack with Subject feature matrix
import scipy
sparse_people_features = scipy.sparse.csr_matrix(people_features)
print people_features.shape
print sparse_people_features.shape


(10295, 3530)
(10295, 3530)

In [9]:
print sparse_people_features.shape
print train_tfidf.shape
feature_matrix = scipy.sparse.hstack([sparse_people_features, train_tfidf])
print feature_matrix.shape


(10295, 3530)
(10295, 3119)
(10295, 6649)

In [10]:
# Now lets one-hot encode labels to perform binary classification
label_matrix = pd.get_dummies(df['FolderId'])
print label_matrix.shape
label_matrix.head()


(10295, 39)
Out[10]:
AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQALOo6CFxH4Rb3A38IMpKY5AAAFEg0eAAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQALOo6CFxH4Rb3A38IMpKY5AAAFEgk0AAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQALOo6CFxH4Rb3A38IMpKY5AAAKOdYsAAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQALOo6CFxH4Rb3A38IMpKY5AAAKOdYuAAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQALOo6CFxH4Rb3A38IMpKY5AAAqdP4WAAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQALOo6CFxH4Rb3A38IMpKY5AABNVymLAAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQALOo6CFxH4Rb3A38IMpKY5AABqh4nMAAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQALOo6CFxH4Rb3A38IMpKY5AABud1doAAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQALOo6CFxH4Rb3A38IMpKY5AADNL6AXAAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQALOo6CFxH4Rb3A38IMpKY5AAG0-IlaAAA= ... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQBMQGTIN_SLRZ_wsASPVpOdAIWVwTtLAAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQBMQGTIN_SLRZ_wsASPVpOdAIWVwTtMAAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQBMQGTIN_SLRZ_wsASPVpOdAIWVwTtmAAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQBMQGTIN_SLRZ_wsASPVpOdAIWVwTuSAAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQBMQGTIN_SLRZ_wsASPVpOdAIWVwTueAAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQBMQGTIN_SLRZ_wsASPVpOdAIWVwTvJAAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQBMQGTIN_SLRZ_wsASPVpOdAIWVwTvWAAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQDXaLnF9Mr4Qa3bv2_rR6E8AAAAHe9MAAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQDXaLnF9Mr4Qa3bv2_rR6E8AAAAHsChAAA= AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQDXaLnF9Mr4Qa3bv2_rR6E8AAABOUJhAAA=
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 39 columns

Train two models & compare accuracies



In [29]:
# Split into test and training data sets
from sklearn.model_selection import train_test_split
labels_train, labels_test, features_train, features_test, binary_labels_train, binary_labels_test = train_test_split(df['FolderId'], feature_matrix, label_matrix, test_size=0.20, random_state=42)
print labels_train.shape
print labels_test.shape
print features_train.shape
print features_test.shape
print binary_labels_train.shape
print binary_labels_test.shape


(8236,)
(2059,)
(8236, 6649)
(2059, 6649)
(8236, 39)
(2059, 39)

In [30]:
# Train a default Logistic Regression model, with no tuning
from sklearn.linear_model import LogisticRegression
default_lgr_model = LogisticRegression().fit(features_train, labels_train)

In [31]:
# Evaluate default Logistic Regression model on test data
default_lgr_predictions = default_lgr_model.predict(features_test)
from sklearn import metrics
print metrics.accuracy_score(labels_test, default_lgr_predictions)
# print np.mean(default_lgr_predictions == labels_test)
print metrics.confusion_matrix(labels_test, default_lgr_predictions)
# metrics.classification_report(labels_test, default_lgr_predictions)


0.936376881982
[[  0   0   0 ...,   0   0   1]
 [  0  81   1 ...,   0   3   0]
 [  0   1 114 ...,   0   0   0]
 ..., 
 [  0   0   0 ...,   0   0   0]
 [  0   4   0 ...,   0  16   0]
 [  0   0   0 ...,   0   0  31]]

In [32]:
import numpy as np
from sklearn.linear_model import LogisticRegression
class Folder_Ensemble_Classifier:
    _folder_models = []
    _model_class_labels = []
    
    def fit(self, training_feature_matrix, label_matrix):
        self._folder_models = []
        self._model_class_labels = []
        for folder in label_matrix.columns:
            self._folder_models.append(LogisticRegression().fit(training_feature_matrix, label_matrix[folder]))
            self._model_class_labels.append(folder)
        return self
    
    # TODO: This needs to work on arrays
    def predict(self, input_feature_matrix):
        model_predictions = []
        for model in self._folder_models:
            model_predictions.append(model.predict_proba(input_feature_matrix)[:,1])
            # ^That's a matrix of predictions, with dimensions samples x models
             
        # Array of highest probabilites & their labels, samples long
#         best_predictions = np.array([np.zeros(input_feature_matrix.shape[0]), np.empty(input_feature_matrix.shape[0], dtype=str)])
        best_predictions = pd.Series(np.zeros(input_feature_matrix.shape[0]))
        best_predictions_labels = pd.Series(np.empty(input_feature_matrix.shape[0], dtype=str))
        
        # For each sample, find the best model
        for i in range(len(model_predictions)):            

#             # Fails, bool expression reverted to single bool, which can't be passed to np.place
#             prediction_bools = model_predictions[i] > best_predictions[0]
#             np.place(best_predictions[0], prediction_bools, model_predictions[i])
#             np.place(best_predictions[1], prediction_bools, self._model_class_labels[i])
            
#             # Fails, bool expression reverted to single bool which is not a valid index
#             best_predictions[0][model_predictions[i] > best_predictions[0]] = model_predictions[i]
#             best_predictions[1][model_predictions[i] > best_predictions[0]] = self._model_class_labels[i]

            # Using pandas instead of numpy seems to work better
            model_vals = pd.Series(model_predictions[i])
            best_predictions_labels[model_vals > best_predictions] = self._model_class_labels[i]
            best_predictions[model_vals > best_predictions] = model_vals

        # TODO: Should I generate a composite/average/relative probability?
        d = {'predictions' : best_predictions_labels,
             'probabilities' : best_predictions}
        return pd.DataFrame(d)

In [33]:
# Train ensemble model
ensemble_clf = Folder_Ensemble_Classifier().fit(features_train, binary_labels_train)

In [34]:
# Make predictions & evaluate
ensemble_lgr_predictions = ensemble_clf.predict(features_test)
ensemble_lgr_predictions.head()


Out[34]:
predictions probabilities
0 AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... 0.939439
1 AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... 0.992206
2 AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... 0.674886
3 AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... 0.986488
4 AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... 0.820690

In [35]:
# Evaluate model against test data
from sklearn import metrics
print metrics.accuracy_score(labels_test, ensemble_lgr_predictions['predictions'])
print metrics.confusion_matrix(labels_test, ensemble_lgr_predictions['predictions'])


0.936376881982
[[  0   0   0 ...,   0   0   1]
 [  0  81   1 ...,   0   3   0]
 [  0   1 114 ...,   0   0   0]
 ..., 
 [  0   0   0 ...,   0   0   0]
 [  0   4   0 ...,   0  16   0]
 [  0   0   0 ...,   0   0  31]]

Conclusions

  • Hey, this sci-lit learn thing performed pretty well...
  • Ha, the custom implemented ensemble model did exactly what the out-of-box Lgr model did, namely, one-vs-all
  • Hypothesis that training per-folder models will help: failed
  • Hey, increasing the size of the data set 5x upped thea accuracy 4%
  • Training on 2/3 of the data instead of 4/5 dropped the accuracy by 1%
  • New hypothesis: When you think about decision boundaries, maybe linear models aren't well suited for this task. I want the likelihood of a given class to not necesarily increase as a certain feature increases... It is the combination of certain ranges of values that indicates a particular class.

In [ ]: